Emergence of information transmission in a prebiotic RNA reactor 
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A poorly understood step in the transition from a cliemical to a biological world is the emergence 
of self-replicating molecular systems. We study how a precursor for such a replicator might arise in a 
hydrothermal RNA reactor, which accumulates longer sequences from unbiased monomer influx and 
random ligation. In the reactor, intra- and inter-molecular basepairing locally protects from random 
cleavage. By analyzing stochastic simulations, we find temporal sequence correlations that constitute 
a signature of information transmission, weaker but of the same form as in a true replicator. 



The RNA world theory [T| posits that the first infor- 
mation carrying and catalytically active molecules at the 
origin of life were RNA-Hke polynucleotides [2] . This idea 
is empirically supported by the discovery of ribozymes, 
which perform many different reactions [3] , among them 
the basic template-directed ligation and polymerization 
steps [H [S] necessary for replicating RNA. However, 
a concrete scenario how a self-replicating RNA system 
could have arisen spontaneously from a pool of random 
polynucleotides is still lacking. Physical effects may have 
facilitated this step, as is believed to be the case in other 
transitions of prebiotic evolution [5]. 

From the perspective of information, an RNA repli- 
cator transmits sequence information from molecule to 
molecule, such that the information survives even when 
the original carrier molecules are degraded, for instance 
due to hydrolytic cleavage Rephrased in these terms, 
the problem of spontaneous emergence of an RNA repli- 
cator ISl [S] becomes a question of a path from a short 
term to a lasting sequence memory. Either this transition 
occurred as a single unlikely step or as a more gradual, 
multi-step transition. Here, we explore a scenario of the 
latter type, based only on simple physico-chemical pro- 
cesses, see Fig. [l] (i) random ligation of RNA molecules, 
e.g. in a hydrothermal "RNA reactor", where polynu- 
cleotides are accumulated by thermophoresis [TD], (ii) 
folding and hybridization of RNA strands, and (iii) pref- 
erential cleavage of single- rather than double-stranded 
RNA segments [7|- Using extensive computer simula- 
tions and theoretical analysis, we study the behavior that 
emerges when these processes are combined. 

Clearly, the preferential cleavage at unpaired bases ef- 
fectively creates a selection pressure for base pairing in 
the reactor. We find that this effect increases the com- 
plexity of RNA structures in the sequence pool, which 
may favor the emergence of ribozymes. The underlying 
sequence bias also extends the expected lifetime of se- 
quence motifs in the finite pool. Interestingly, we find 
that correlations between motifs persist even longer than 
expected. This memory effect is associated with in- 
formation transmission via hybridization. Intriguingly, 
these correlations have the same statistical signature as 



templated self-replication, only weaker. In this sense, 
the RNA reactor could constitute a stepping-stone from 
which a true RNA replicator could emerge, e.g., assisted 
by a primitive ribozyme catalyzing template-directed 
synthesis. 

RNA reactor. — As illustrated in Fig. [TJ we envisage 
an open reaction volume V under non-equilibrium con- 
ditions as, e.g., inside a hydrothermal pore system where 
polynucleotides are strongly accumulated by a combina- 
tion of convective flow and thermophoresis [10 . At any 
point in time, the reaction volume contains various se- 
quences Sl of length L. The full time evolution of this 
pool is a stochastic process with the reactions 
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We assume a constant and unbiased influx of monomers 
(ACGU) at rate J. The effective outflux rate (11 = 
dQe'^^/^"^^'^ accounts for the strong accumulation of nu- 
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FIG. 1: (Color online) Illustration of the RNA reactor. Left: 
Combined action of convection and thermophoresis in narrow 
pores subject to a temperature gradient results in strong ac- 
cumulation of nucleotides, as indicated by the darker shading. 
Right: The region of high concentration defines an open re- 
action volume where nucleotides enter and bonds are formed 
through ligation reactions. Equilibrium base pair formation 
protects bonds next to paired nucleotides (dark) from cleav- 
age. Length-dependent outflux accounts for the preferential 
accumulation of long molecules. 
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cleotides in a pore system, with a characteristic length 
dependence determined by the length scale Lc, which 
comprises parameters such as Soret coefRcient, tempera- 
ture gradient, and geometry Ligation of monomers 
or oligomers occurs at fixed rate a [T^] . Finally, the most 
essential ingredient is a backbone cleavage process with 
a rate that depends on the base-pairing probability of 
the neighboring bases, such that double-stranded RNA 
is more stable than single-stranded RNA. Specifically, 
we calculate the cleavage rate (5l,k = /^o(l — Pl,k) at 
backbone bond K using the average base-pairing proba- 
bility pl,k of the two neighboring bases. We allow both 
intra-molecular base pairs within single sequences and 
inter-molecular base pairs within duplexes of any two 
molecules. RNA folding is performed by means of the 
Vienna package [131 [H] , where the partition function of 
the entire ensemble is calculated assuming chemical equi- 
librium |15j . warranted by the fast hybridization kinet- 
ics 'S]. 

We use the standard Gillespie algorithm to simulate 
the stochastic dynamics ([l]) of the sequence pool. The 
cleavage rate /3l,k, which is recalculated from the folding 
output for all molecules whenever necessary, effectively 
introduces a selection for base-pair formation. Since 
RNA folding depends on the temperature T and du- 
plex formation is also concentration-dependent, we can 
vary the selection pressure via pL,KiT,V). We consider 
the reactions ([T]) under different possible conditions, with 
two different temperatures (a cold system at 10° C and a 
hot environment at 60° C) and concentrations (in the pM 
and mM range, respectively). To study the differences 
to a random pool, we also consider a "neutral" scenario 
without folding {pl,k — 0). These scenarios are chosen 
mainly to highlight the effects of base pairing and not to 
suggest specific environmental conditions at the origin of 
life. 

Stationary length and shape distribution. — Disre- 
garding sequence-dependent selection, the ligation- 
cleavage dynamics of the RNA reactor resembles the ki- 
netics of cluster aggregation and fragmentation. Hence, 
the stationary sequence length distribution shown in 
Fig. [2ja) corresponds to a cluster size distribution, and 
its moments can be obtained using established meth- 
ods [HI [TB] . In the limit of large influx J, the average 
total molecule number (Ntot) and their mean length (L) 
are given by: 
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where we have neglected the length dependence of the 
outfiux (Lc — > oo; a finite value for shifts both (Ntot) 
and (L) to larger values without strongly affecting the 
shape of the distribution) . These analytical results read- 
ily explain why with stronger selection the total number 
of molecules decreases, but their mean length goes up (see 
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FIG. 2; (Color online) Steady-state properties of the sequence 
pool; (a) length distribution (Nl), (b) total number (Ntot) 
of molecules, and (c) their mean length (L). (d) base pairing 
probability (Pl) averaged over sequences of length L, with 
mean (p) shown in (e). (f) structural repertoire of long se- 
quences: steady-state probabilities for sequences longer than 
L* = 35 (shaded parts: L* = 50), which fold into a structure 
of similar shape as the indicated schematic drawings. Selec- 
tion strength increases from light to dark color as indicated in 
the legend. All observables are averaged over time and 10 in- 
dependent replicas. Remaining parameter values were J — 1, 
a = .001, Po = .01, do = .005, Lc = 10. 



Fig. [2][b) and (c)): the cleavage rate /3o is reduced as the 
mean base pairing probability (Pl) is increased especially 
for longer sequences (cf. Fig. [2jd)), and the distribution 
thus gains more weight in the tail of long sequences. 

In order to characterize the structural repertoire of this 
RNA pool, we focused on the tail of the length distribu- 
tion and analyzed the secondary structures of long se- 
quences with L > L*. We performed the analysis for 
L* = 35 as well as L* = 50 (the length of the minimal 
hairpin ribozyme |17)). Fig. [2jf) shows the probability 
to observe structures within basic "shape" classes |18j . 
such as hairpins or hammerheads [19J. We observe a sig- 
nificant enrichment of complex structures under selection 
compared to the neutral case defined above. 

Information transmission via hybridization. — Base 
pairing and the ensuing correlations between sequences 
occur mostly within relatively short sequence regions. 
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Therefore, we focus on the dynamics of shorter subse- 
quences or "sequence motifs" of length ^, which are in- 
formational entities not tied to a specific molecule. From 
our simulations, we extract time trajectories for the copy 
numbers ni{t) of all 4^ different motifs. Even for fairly 
small ^ > 3, the sequence space of motifs is not fully cov- 
ered in the finite ensemble, i.e., an average motif copy 
number is typically {ni{t)) ^ 1. Hence, signatures of in- 
formation transmission should appear as an unexpected 
increase in the lifetime of these motifs. Suitably aver- 
aged observables are provided by the auto- and cross- 
correlation functions, Ca(t) — ^ - (ni(t)ni(O)) and 
Cc{t) = 4~^^- (ni(<)n*(0)), respectively, where n* is 
the copy number of a motif's (reverse) complement |19j . 
Fig. [3]Ja) and (b) show data for these correlation func- 
tions for £ = 6 and the parameter set used in Fig. [2] 

The observed motif correlations can be understood in 
the framework of a simple stochastic process. Motifs are 
created when sequence ends are ligated together and de- 
stroyed by cleavage [5^. Using a mean- field-type ap- 
proach, we pick an arbitrary probe motif with copy num- 
ber n{t). Its dynamics is described by a birth-death pro- 
cess, where n{t) is increased with constant rate and 
decreased with linear rate see schema (i) in Fig.[3]jc). 
The birth rate can be computed from the steady-state 
length distribution (iV^) by counting how many ends of 
long enough molecules are available for ligation. Assum- 
ing an annealed random ensemble, we obtain 
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The death rate k- comprises the effects of cleavage and 
hybridization. A motif is cleaved with rate /3o at any 
of its £ — 1 bonds, but this rate is reduced by the effec- 
tive base pairing probability of its parent sequence, which 
in turn depends on the selection strength. On average, 
this reduction follows from averaging over the length and 
base-pairing probability distributions (Nl) and (Pl) of 
parent sequences, respectively. This gives the result 
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However, a birth-death process based on these two ef- 
fective rates alone necessarily fails to describe cross- 
correlations between a motif and its complement |21j . 
The reduction in the cleavage rate of a particular motif 
due to hybridization is conditional on the presence of its 
complementary partner. Hence, we modulate the aver- 
age death rate fc_ with an additional factor h(x) < 1, 
which accounts for the probability of hybridization and 
depends on the number x = n*/n of available comple- 
ments per motif. Since the average hybridization prob- 
ability is small under the conditions considered here, it 
will be proportional to x. This leads us to a linear ansatz 




FIG. 3: (Color online) Information transmission among se- 
quence motifs, (a) and (b) auto- and cross-correlation func- 
tions Ca/c(i) from simulation data for £ = 6 (dots) together 
with analytical expressions from Eq. ([6| (solid lines). The 
rates fc_ and fc+ have been computed from Eqs. ([3| and Q, 
with r as only fit parameter, (c) schemata for different birth- 
death processes: (i) motifs are created with constant rate fc+ 
and destroyed with linear rate k-h(n* /n), which is reduced by 
hybridization to their complements; (ii) motifs are destroyed 
with fixed rate but are copied from their complements 
with rate r. To leading order in r/k-, both processes give 
rise to identical correlation functions Ca/c(i), where a non- 
constant Cc(i) signifies information transmission between a 
motif and its complement, (d) dependence of the replication 
efficiency r/k- on the cleavage rate /3o (error bars indicate 
95% confidence intervals). Color code as in Fig. [5] 



h{x) w f — {r/k^)x, where the significance of the coefR- 
cient r will shortly become apparent. We find that in the 
"hybridization process" of Fig. [sjc) , the expected copy 
number (n) of a motif obeys 
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A symmetric equation holds for (n*). Strikingly, this 
result is identical to the corresponding rate equations 
for a "replication process" |14j . where motifs are born 
with rate destroyed with fixed rate and copied 
from their complements with rate r, as in schema (ii) of 
Fig. [3]jc). This observation suggests that we may inter- 
pret the coefficient r as an apparent replication rate for 
motifs in the RNA reactor. 

To validate this interpretation, and to measure the ap- 
parent replication rate in our simulations, we calculate 
the correlation functions of the hybridization process us- 
ing the same approximation for h{x) |I4I . yielding 
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In Fig. [sj^a) and (b), we used these expressions with the 
rates fc+ and k^ calculated from Eqs. ([3| and Q, and 



with r as only free parameter fitted simultaneously to 
both datasets. The equivalence between the hybridiza- 
tion and the replication processes is also exhibited by 
their correlation functions to leading order in r/k- |14j . 
Hence, the good agreement with the simulation data in- 
dicates that the observed motif correlations are virtu- 
ally indistinguishable from those expected for inefficient 
template-directed replication. The replication efhciency 
r/k_ determined by the fits is plotted in Fig. [Sjjd) as 
function of the bare cleavage rate /3o for the different 
conditions. Remarkably, it reaches levels close to 30 % 
in the cold and highly concentrated environment, where 
base pairing via duplex formation is favorable. Note that 
a true (exponential) replicator would require that motifs 
are copied faster than they are degraded (r > while 
our system with r < fc_ is an inefficient realization. 

These findings show that protection against cleavage 
due to folding and hybridization leads to an extended 
sequence memory in the RNA reactor. One global con- 
tribution to this longer motif lifetime is due to the "pro- 
tection factor" in square brackets in Eq. Q , which renor- 
malizes the bare cleavage rate to account for the average 
probability that a motif is paired. Another contribu- 
tion stems from the correlation time in Eq. Q, which is 
increased as the apparent replication rate is subtracted 
from the renormalized cleavage rate, such that Ca/c(t) 
decays on time scales of order — r)"^. This specific in- 
crease occurs only when a motif and its complement mu- 
tually protect each other, and it therefore demonstrates 
the emergence of information transmission. 

Conclusions. — We have analyzed stochastic simula- 
tions of a minimal prebiotic RNA reactor, where for- 
mation of double strands protects sequence parts from 
degradation. On the one hand, this selection for structure 
biases the resulting pool towards longer and more struc- 
tured sequences, favoring the emergence of ribozymes. 
On the other hand, it leads to a weak apparent repli- 
cation process based on "information transmission by 
hybridization" , conceptually similar to "sequencing-by- 
hybridization" techniques [H]. Together, the structural 
complexity and the information transmission featured in 
the RNA reactor suggests this type of system as plausible 
intermediate for the emergence of a true replicator with 
r > k-. For instance, some of the relatively frequent 
simple structures observed in our simulation are similar 
to known ligase ribozymes [3] . This functionality in turn 
would facilitate the creation of more complex molecules 
from essential modular subunits |23j . Once ribozymes 
emerge, a self-replicating system could be established 
by template-directed ligation of suitably complementary 
oligomers [1]. So far, it remained unclear how such auto- 
catalytic RNA systems would be supplied with appro- 
priate oligomer substrates. However, the strong cross- 
correlations observed in the RNA reactor demonstrate a 
significantly enhanced chance of finding sequences com- 
plementary to those present in the pool, including the 



sequence to be replicated. Thus, the RNA reactor acts 
as an adaptive filter to preferentially keep potentially use- 
ful substrate sequences. This adaptive selectivity would 
allow for the "heritable" propagation of small variations 
and thus endow the replicator with basic evolutionary 
potential. 
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I. SUPPLEMENTARY FIGURES 




FIG. SI. GU pairs. Properties of the steady-state ensemble as in Fig. (1) of the main text, but in 
a simulation including GU wobble pairs, (a) length distribution {Nl), (b) total number (Ntot) of 
molecules, and (c) their mean length (L). (d) base pairing probability {pl) averaged over sequences 
of length L, with mean (p) shown in (e). (f) Structural repertoire of long sequences. While the 
differences to the results without GU pairs shown in Fig. 2 in the main text are comparably small, 
we observe that this additional pairing mode provides additional stability especially for longer RNA 
and thus further increases the chances of finding structured molecules in random pools. 
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FIG. S2. Shorter motifs. Dependence of the rephcation efficiency r/k^ on the bare cleavage rate 
/3o as in Fig. (3c) in the main text, but for shorter motifs of length £ = 4 (a) and ^ = 5 (b). 
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FIG. S3. Analysis of the influence of self-complementary sequences. Self-complementary sequences 
in the pool give rise to different motif statistics. To test this effect, we ran control simulations with- 
out RNA folding but fixed sequence- independent base pairing probabilities (p^) chosen from the 
distributions measured in the full simulation (cf. Fig. 1(d) in the main text). This leads to almost 
identical length statistics in the sequence ensemble, but motif correlations due to hybridization are 
absent. Shown is the dependence of the apparent replication efficiency r/A;_ on the bare cleavage 
rate /3o as in Fig. (3c) in the main text. Self-complementarity gives rise to subdominant cross- 
correlations resulting in small non-zero values for r largely independent of the "selection strength" 
(note the different scale on the ordinate). 
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II. IMPLEMENTATION DETAILS: CALCULATION OF BASE PAIRING PROB- 
ABILITIES 



Our code is based on the Gillespie algorithm [1] for the stochastic simulation of chemical 
reactions. At each time step, we compute the propensities for each of the four possible 
reactions involving sequences SL,i of length that are present in Nl^i copies: 

1. influx of a monomer with propensity J (monomers are chosen randomly among the 
four nucleotides A,C,G, and U); 

2. outflux of one of N^^i copies of SL,i with propensity doN^ie'^^^^^"; 

3. ligation of two sequences S^i and Slj to a combined sequence of length Li+Lj < Lj^ax 
with propensity aN^i{N^j - 5ij). 

4. cleavage of one of NL,i copies of a sequence SL,i at position K with propensity NL^if^L.^K 
where = /3o(l - PL„i^(r, 1/)). 

One event is randomly chosen according to its relative propensity, and time is updated by 
a time interval drawn from an exponential distribution with a mean equal to the inverse of 
the sum of the propensities. 

The first two steps are straightforwardly implemented, but some explanations on the lat- 
ter two are in order. Firstly, we neglect a possible length dependence of the ligation reaction, 
which is poorly understood on a microscopic level, but probably rather weak [2]. Also, we 
scale out its volume dependence to facilitate comparison of different scenarios, which operate 
at different concentrations. Finally, we restrict ligation to sequences with combined length 
smaller than Lmax = 100 to limit computationally expensive RNA folding. Secondly, the 
cleavage reaction involves the sequence-specific, temperature- and concentration-dependent 
probabihty pii^xiT, V) that the nucleotides next to bond K are paired. The calculation is 
done by means of the Vienna package for RNA secondary structure folding [3]. We allow 
both intra- and intermolecular base pairs in complexes involving at most two sequences. To 
simplify the following argument, we omit the length index on the sequences S^. For each 
sequence Si, we calculate the simplex partition sum Zj for all possible secondary structures 
of that sequence, and the corresponding duplex partition sums Zij that result from folding 
a duplex involving two molecules Si and Sj. Note that duplex formation is concentration 
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dependent, and we therefore need to calculate the partition sum Z of the ensemble of se- 
quences [4-6]. If each sequence is initially present in copies, and the ensemble after 
hybridization will contain rij simplex structures and riij duplex structures, this partition 
sum can be written as: 



under the mass conservation constraint that each sequence be part of at most one complex 
at the same time: 

Ui + Inn + XI = n°. (2) 

The chemical equilibrium is obtained by minimizing the ensemble free energy T — —k^T In Z 
with respect to the variables and riy, under the constraint Eq. (2). Even though in our 
relatively small system these variables are all small numbers, we can efficiently perform 
this calculation only in the thermodynamic limit, assuming rapid chemical equilibration 
due to the very fast hybridization kinetics [7] and the convective flow cycles encountered 
in the thermal trap. Hence, we switch to concentration variables q = nijV in a volume V 
(correspondingly for and Qj). 

Following Ref. [6], we now introduce Lagrange multipliers \ (which are chemical poten- 
tials measured in units of k-^T\ and minimize L — T jk^T -|- Aj(c^ — Q — Icn — X^j^j 
instead. Using Stirling's formula, this requires finding the minimum of 



£(c. A) = X [c°(l - In c° + A,) - q(1 - In q + In + A^) 

i 

- X Cjj (1 - In Cij In Zij Aj Aj 



(3) 



The minimum is given by 
where the stationary values A* for the chemical potentials are obtained from minimizing 



dl^Z^, 4 = %e^*+^l, (4) 



(5) 
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Even though this lower-dimensional problem is in principle not ill-conditioned [6], the min- 
imization becomes numerically unstable for large systems on the order of 100 molecules 
with possibly very different hybridization energies. A stable code was obtained by using the 



L-BFGS library [8] implementing the limited-memory Broyden-Fletcher-Goldfarb-Shanno 
algorithm [9], to obtain equilibrium values of c* that obey the mass conservation Eq. (2) 
within a relativ error of at most 10~^. 



Si involves the probabilities Pi,K,j,K' that nucleotide K is paired with another nucleotide at 
position K' of sequence Sj, and therefore the probability Cij/c^ that sequence Si is actually 
part of the corresponding duplex: 



The partition sum Zj of a sequence, and the duplex partition sums Zij and corresponding 
base pairing probabilities Pi^K,j,K' with all other sequences, are computed only once during 
the simulation, namely in the instant a sequence appears for the first time. For the computa- 
tion of the effective cleavage rate /^l^.k = /3o(l ^ PLi,K), only the equilibrium concentrations 
Ci and Cij need to be adjusted every time the sequence ensemble is modified. For this step, 
we only consider events involving sequences large enough to actually fold, i.e., we neglect 
the influence of changes in mono- and dinucleotide concentration. 

Note that our scenarios operate at vastly different temperatures, which gives reason to 
question the quantitative accuracy of the RNA folding results. While the primary tempera- 
ture dependence in the Boltzmann factors is correctly accounted for, the indirect dependence 
of the energy parameters used in the algorithm is captured only via a linear approximation 
around T = 37 °C. However, experimental RNA melting curves have been reproduced rea- 
sonably well over a wide range of temperatures [10], and we believe that small quantitative 
errors should not severity affect our results. 

III. DERIVATION OF THE STEADY-STATE LENGTH DISTRIBUTION 

In the absence of sequence-specific cleavage rates, the sequence length distribution is 
identical to the cluster size distribution obtained in a simple aggregation-fragmentation 
process with a mass-independent aggregation rate a and a fragmentation rate (3L that is 
proportional to cluster size L, with random binary breakage. As a variation on the standard 
problems discussed in the literature, we also include monomer influx with rate J and a 
length-dependent outflux di- For our parameter regimes, we expect that the aggregation- 



The probability PLi,K 



\ \Pi,K+Pi,K+i\ that enters the cleavage rate of bond K of sequence 
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fragmentation dynamics results in a nonequilibrium steady state length distribution A^^ 
(here we omit the angle brackets) . It is obtained as the stationary solution of the following 
mass-balance equation: 



L-l 



K=l K=l K=L+1 

(7) 

The first term on the right hand side describes the creation of a sequence of length L from 
two fragments of sizes K and L — K, while the next term models the ligation of sequence 
Sl to any other sequence (the factor of 2 accounts for the correct counting of same-mass 
clusters). The third term corresponds to the breakage of sequence Sl at any of its L — 1 
bonds, and the next term the production of a sequence of length L as one of the two cleavage 
fragments of a longer sequence. The fifth and sixth terms, respectively, are monomer influx 
and the length-dependent outflux, where the square-root dependence in the exponential 
stems from the specific thermodiffusive behavior of polynucleotides in a thermal trap, and 
the crossover scale combines parameters such as the Soret coefficient, the trap geometry 
and the temperature difference across the trap. Because this complicated length dependence 
inhibits further analysis, we will set Lc ^ oo in the following. 

To proceed with the analysis, we perform the continuum limit Nl — > N{L) in the rate 
equations Eq. (7): 

rL poo 

N{L) dK N{K)N{L - K) - 2aN{L) / dK N{K) - pLN{L) 

.oo (8) 
+ 2/3 J dKN(K) + J5(L) - doN(L). 



Now we introduce the moments 

Mr, 



/•oo 

/ dLVNL, (9) 
Jo 



where Mq — (A^tot) is the number of molecules. Mi — (L) (A^tot) is the total mass, and so 
forth. The rate equations for the moments are given by 

Mn^aJ2 Q MkMn-k - 2aMoMn + f3 (^^^ - 1^ M^+i + Jo - doM^. (10) 

Even though the hierarchy of rate equations for the moments is not closed due to the 



fragmentation term, the equations for the first two moments decouple from the rest: 

Mo = -aM^ + pMi + J - doMo (11) 
Ml = J - doMi. (12) 

Hence, we can easily obtain stationary solutions for the total number of molecules (A^tot) 
and their mean length (L) : 



{N,ot) = \l ^ 2^"V^Z^ .iJ»dJ{a{f3 + do)), (13) 



^io (A^tot) ^ V W + cio) ■ ^^^^ 

Numerical studies indicate that the resulting length distribution is very similar to a F- 
distribution, which can be used in a moment closure approximation to compute higher 
moments [11]. For our parameter regime, the distribution is in fact close to exponential 
((AL^) « {Lf). 

We find that the thermal trap, through an outflux rate that drops with the exponential 
of the square root of sequence length, serves mainly to shift the distribution towards longer 
sequences. It does not, however, significantly affect the shape of tail of the distribution, 
because the dynamics of the longer molecules is mostly determined by their cleavage rate, 
which scales linearly with sequence length and thus quickly beats the outflux. In our simu- 
lations we kept Lc = 10 finite, because thermophoretic accumulation is essential to obtain 
nucleotides at reasonably high concentration in an experimental system. The above analysis 
suggests that the precise value of Lc does not significantly affect our results. 

IV. DERIVATION OF THE AUTO- AND CROSS-CORRELATION FUNCTION 

The master equation for the production and destruction of motifs of length i and their 
complements is given by: 

dtPn,n* = k+\pn-l,n* + Pn,n'-l] + k-[{n + l)/l(^)p„+i,„. + (n* + l)/l(-^)p„,n*+l] 

where n and n* are the copy number of a motif and its complement, respectively, k+ and 
k- are its birth and death rates, and h{x) is the "hybridization function", which describes 
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the decrease in the death rate of a motif in terms of the probabihty of hybridization, which 
in turn is proportional to the number x — n* /n of complements per motif that are available 
for base pairing. 

The dynamics of the mean {n{t)) = 'npn,n*{'t) follows as 

dt (n) =k+-k- {nh{n*/n)) . (16) 

Assuming that h{x) decreases only slowly from unity due to a small hybridization probability, 
we write 

M^)^1-|/.'(0)|^, (17) 

which gives 

dt (n) ^k+-k_ (n) + r (n*) , (18) 

where r = k- \h'{0)\ is the apparent replication rate. Note that Eq. (15) is symmetric with 
respect to n and n*, and we can therefore directly infer the corresponding equation for {n*). 
Conditional on the initial conditions (n(0)) = no and (n*(0)) — n^, the solution of these two 
equations reads: 

(^(^))no,nS = ^ (1 - e-^'-^^O + ^(^0 - n*)e-(^-+'-)* + ^(no + n*)e-(^-'-)*. (19) 
The correlation functions Ca(t) and Cc{t) are defined as 

a(t) = (n(t)n(O)) = 5^ no (^(t))^^,^. p"^,^., (20a) 

CM = {n{t)n%0)) =J2nl {n{t))^^,^.pl,^., (20b) 

where p° is the steady-state solution of Eq. (15). All we actually need are the three steady- 
state averages (no) = (ng), (nl) — {rf^) and (nong), which are obtained from Eq. (15) by 
expanding the hybridization function as in Eq. (17): 

<»o) = W> = -j^^ (21) 

Evaluating Eq. (20) gives Eq. (6) in the main text. 
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For a scenario with actual replication according to the reaction n ^ n + 1, the master 
equation reads: 

+ Pn,n*-l] + k- [{n + l)Pn+l,n* + {n* + l)p„,„*+l] 

(24) 

+ r[n*pn-i,n* + npn,n*-i] " [2A;+ + {k- + r)(n + 
It is easy to check that this equation gives rise to the same expression Eq. (19) for (n(t)) as 
Eq. (15). However, the stationary second moments (uq) and (nonl) are slightly different: 

(nl) = «> = '-^''^ 'li.^''-'' (25) 

The resulting correlation functions read: 

_ kl , fc_fc+e-(^— )* ^ fc_fc+e-(^-+-)* 
Oa/cltj - _ + 2(A;_ - r)2 * 2(A;2 - r^) " ^^'^ 

The time dependence, given through Eq. (19), is clearly the same as that of the correlation 

functions of Eq. (15), and the amplitudes are identical to those of Eq. (6) in the main text 

to first nonzero order in r: 

a(0)-a(oo) = ^ + O(r), (28) 
C,(0)-C,(oo) = ^ + O(r^). (29) 
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